Assessing agreement between different polygenic risk scores in the UK Biobank

Polygenic risk scores (PRS) are proposed for use in clinical and research settings for risk stratification. However, there are limited investigations on how different PRS diverge from each other in risk prediction of individuals. We compared two recently published PRS for each of three conditions, breast cancer, hypertension and dementia, to assess the stability of using these algorithms for risk prediction in a single large population. We used imputed genotyping data from the UK Biobank prospective cohort, limited to the White British subset. We found that: (1) 20% or more of SNPs in the first PRS were not represented in the more recent PRS for all three diseases, by the same SNP or a surrogate with R2 > 0.8 by linkage disequilibrium (LD). (2) Although the difference in the area under the receiver operating characteristic curve (AUC) obtained using the two PRS is hardly appreciable for all three diseases, there were large differences in individual risk prediction between the two PRS. For instance, for each disease, of those classified in the top 5% of risk by the first PRS, over 60% were not so classified by the second PRS. We found substantial discordance between different PRS for the same disease, indicating that individuals could receive different medical advice depending on which PRS is used to assess their genetic susceptibility. It is desirable to resolve this uncertainty before using PRS for risk stratification in clinical settings.

www.nature.com/scientificreports/ selected by the penalised regression "lassosum" and the highest pseudo-R 2 . However, these PRS are typically compared at a population level using metrics such as the AUC or OR, and limited attention has been paid to how they differ from each other for risk prediction of individuals. Therefore, it is important to understand how the use of different PRS affect an individual's classification of risk for future disease, as this has important implications for the use of PRS in wider practice.

Materials and methods
Study populations. We used the data from the UK Biobank (UKB), a large-scale population-based prospective cohort study of approximately 500,000 individuals aged 40-69 years at recruitment across the United Kingdom between March 2006 and October 2010. The full details of the genotyping and imputation are described elsewhere 14,15 .
Our study populations for each of the three disease outcomes are defined as follows: • The breast cancer eligible population was women who had not had breast cancer, carcinoma in situ or mastectomy prior to baseline. • For the hypertension eligible population, we excluded individuals with missing or implausible systolic blood pressure (SBP) measurements (< 70 or > 270 mmHg) at baseline, and those with Major Adverse Cardiovascular Events (MACE) prior to baseline. • The dementia eligible population was restricted to individuals without a diagnosis of dementia prior to baseline.
We further restricted to genetically White British individuals (UKB Data Field 22006), and excluded individuals who were related (3rd degree or higher), sex discordant, or outliers for genotype missingness or heterozygosity based on UK Biobank-derived sample quality control data (UKB Data Field 22020). This yields the final size N of each study population.
Disease ascertainment in UKB during the follow-up period utilised linkage to death registry, cancer registry, and Hospital Episode Statistics (HES). Hypertension is defined as SBP >= 140 at baseline; the International Classification of Diseases (ICD) codes for breast cancer and dementia can be found in Supplementary Tables 14-16.

Selection of PRS.
For each of the three disease outcomes, we selected a pair of recently published PRS to compare, typically published within two years of each other. The earlier PRS is denoted as PRS-A, while the more recent one is PRS-B.
When choosing PRS we identified scores that were derived using the same trait definition in primarily White European populations, to be appropriate for use in the UK Biobank, and that had been derived in large consortia for their respective disease. Where possible, we selected scores listed in the Polygenic Score Catalog (PGS Catalog), an online database that collects and curates PRS from across the literature and makes metadata available in a standardised way. Table 1 summarises the characteristics of the chosen PRS for each trait, including the construction method. Supplementary Table 1 contains further information about each PRS, including source of weights (i.e. derivation dataset), population characteristics, and validation dataset.
• For breast cancer, PRS-A (313 SNPs, PGS ID: PGS000004) 12 has been widely validated and is included in the current implementation of the BOADICEA breast cancer risk model 16,17 . For PRS-B, we used a score (118,388 SNPs, PGS ID: PGS000511) 13 21 and PRS-B (39 SNPs, PGS ID: PGS001775) 22 used effect sizes from the International Genomics of Alzheimer's Project (IGAP) GWAS 23 . PRS-A was constructed in January 2021 while PRS-B was developed in September of the same year. We selected PRS that did not contain the two APOE SNPs (rs429358 and rs7412), in order to prevent the APOE genotype from dominating the PRS.
We examined the overlap in SNPs between PRS-A and PRS-B for each disease, including those in high linkage disequilibrium (LD) (R 2 > 0.8).
We took care to attempt to avoid the sample overlap issue, a potential pitfall for PRS 24 . Since we planned to calculate PRS in the UKB population (i.e. target cohort), we preferred PRS that were not derived in the UKB population. During the PRS selection stage, we examined the base GWAS cohorts for PRS derivation and attempted to ensure that they do not contain our target cohort (i.e. UKB).
We were able to identify such PRS for breast cancer and dementia, but not PRS-A for SBP in the existing recent literature to the best of our knowledge. We investigated all the available PRS for SBP in the PGS Catalog, and found that UKB was present in all the derivation populations. Given the extensive blood pressure measures and genetic data in the UKB, it is unsurprising that researchers would include UKB in their derivation population for PRS.

Calculating PRS. We computed the PRS of an individual j by the weighted sum of trait-associated SNPs,
where N is the total number of SNPs, β i is the effect size (or beta) of SNP i , and dosage ij is the number of effect alleles (usually encoded as 0, 1 or 2 in SNP i for individual j for the effect allele).
We applied genetic quality control (QC) pipelines for both SNPs and samples. During SNP QC, we removed ambiguous SNPs (A/T or C/G SNPs with MAF > 0.49) and rare variants with MAF < 0.005; we only retained SNPs with high imputation quality (imputation information score > 0.4) (Supplementary Table 1). During sample QC, we excluded participants who were sex-discordant, outliers for missingness or heterozygosity, or related at 3rd degree or higher, using UKB Data Field 22020.
We then weighted the SNPs that passed our QC using the published effect sizes provided from the source paper for each score, given either in their supplementary materials or made available in the PGS Catalog, to compute PRS for those within the study population for each of the three disease outcomes.
Quantifying the stability of PRS. In each disease-specific study population, we first calculated the correlation coefficient between each pair of PRS. We then computed the age-and sex-adjusted odds ratios (ORs) at various cut-points (e.g. top 1% or top 5% of the PRS), with the middle quintile of the PRS being the reference group.
We computed two versions of the AUC obtained from two separate logistic regression models for each continuous PRS: 1. Crude-AUC, where only PRS were fitted, adjusting for genetic array, and first 5 principal components (PCs) of ancestry (UK Biobank Field 22009), to reflect the predictive performance from the PRS itself. 2. Multivariable adjusted AUC (Multi-AUC), where the model further adjusted for age and sex (if applicable).
The outcome of these logistic regression models is the disease status (Yes/No), for each of the three diseases. The Crude-AUC measures the predictive ability of the PRS on its own, while the Multi-AUC measures this after having taken account of age and sex.
The continuous net reclassification index (NRI) was used to compare PRS-A with PRS-B in multivariable logistic models ( Table 2). The categorical NRI was used in cross-classification of PRS percentile risk categories. Percentage reclassification for participants who experienced the outcome are shown in Supplementary Tables 5-7 and 11-13, for top 1% and top 5% risk categorisations, respectively.
The 95% confidence interval (CI) of the AUC and NRI were estimated using 1 k bootstrap replications.
Ethics approval and consent to participate. The

Consent for publication. Yes.
Transparency statement. The lead author affirms that this manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

Results
Our study populations were N = 171,490 (incident cases = 6347) for breast cancer, N = 317,581 (prevalent cases = 137,649) for hypertension, and N = 335,689 (incident cases = 4460) for dementia. For comparing different PRS, we focused on two aspects: firstly, the consistency of the selected SNPs and performance metrics. Then we assessed the correlation between each pair of PRS, and the extent to which PRS-B gave the same predictions for individuals as PRS-A. We found that less than 80% of SNPs in PRS-A were represented in PRS-B for all three diseases, after having taken LD into account (R 2 > 0.8). This is somewhat surprising, as one might expect a newer score (PRS-B) to incorporate most of the previously identified SNPs from PRS-A. Table 2 presents the performance characteristics of each PRS against the corresponding disease outcome in UKB. In each case the more recent PRS-B was associated with a slightly higher OR than the earlier PRS-A. For example, the OR of breast cancer among women in the top 1% compared to those in the middle quintile The Crude-AUC of PRS were modest, within 0.55-0.64. After including age and sex in the multivariable model, we observed an expected increase in AUC. Despite similar Multi-AUC, PRS-A and PRS-B were not very highly correlated for any outcome, with their Pearson correlation coefficient r only in the range of 0.51 to 0.65. Multi-AUC is noticeably higher than Crude-AUC for dementia, whereas such improvement is less prominent for breast cancer and hypertension. The likely explanation is that age is a more highly influential risk factor for dementia, but less so for breast cancer and hypertension. Table 2 also shows NRI with PRS-B being the updated model. The positive NRI indicates that PRS-B is better at correctly assigning people to the appropriate risk categories. The small positive NRI is in line with the slight increase in the AUC.
Compatible with these correlation coefficients, there was substantial reclassification of predicted risk according to percentiles of PRS-A and PRS-B for all three diseases. Table 3 shows that for women in the top 1% of breast cancer risk by PRS-A, only 23.1% were in the top 1% risk of PRS-B. The equivalent percentage was 22.9% and 22.7% for hypertension and dementia, respectively (Supplementary Tables 3-4). We focused on the top 1% of risk because of the widely-promulgated concept that these risks approximate those of the risks for monogenic traits 4 .
For participants in the top 5% (a more relaxed risk category than the top 1%) of risk for breast cancer by PRS-A, only 35.7% were in the top 5% risk by PRS-B. The equivalent percentage was 35.8% and 40.0% for hypertension and dementia, respectively (Supplementary Tables 8-10). Table 2. PRS compared for each outcome and their performance characteristics in the UK Biobank. N: number of participants whose PRS score was obtained. nSNPs: number of SNPs in PRS prior to genetic quality control. OR: odds ratio for top 1% versus middle quintile of PRS from multivariable logistic regression model adjusted for age, sex, genotyping array and first 5 PCs. AUC: area under receiver-operating curve. Crude-AUC: only continuous PRS was fitted in a regression model. Multi-AUC: continuous PRS was fitted, further adjusted for age and sex. NRI: continuous net reclassification index obtained from predicted risks by two multivariable logistic regression models that contain age, sex, continuous PRS for this disease, genotyping array and first 5 PCs. The model containing PRS-B is considered the "updated" model. r : Pearson correlation coefficient between the two continuous PRS for this disease. LD: number (%) of SNPs in PRS-A which either appear in or are in linkage disequilibrium (R 2 > 0.8) with SNPs in PRS-B. Breast cancer models are not adjusted for sex because its population is restricted to females. 95% CI for AUC and NRI calculated by bootstrapping. www.nature.com/scientificreports/

Discussion
The clinical utility of PRS depends on the clinical validity of the predictions. Clinical validity is not only dependent on the information PRS provides on the risk of future events, but also on the stability of these estimates.
Here, we demonstrate that for three common conditions (breast cancer, hypertension and dementia), the risk estimates derived from different PRS would result in very different information on risk of future disease being provided to a high proportion of individuals. Choice of the PRS may also influence the use of PRS as covariates or effect modifiers in epidemiologic analyses. We found that the more recent PRS resulted in minimal increases in AUC compared to older PRS, in line with the small improvement measured by NRI. However, the PRS differed substantially in how they assigned participants into risk categories with a substantial proportion of individuals classified at very high risk by one PRS, not so classified by the other PRS. This suggests a major potential problem for the use of these PRS in clinical practice, given the changes in clinical recommendations associated with labelling a person in the same category www.nature.com/scientificreports/ of risk as a monogenic disorder. Our results demonstrated large differences across all percentiles of risk; although the clinical consequences at the lower percentiles may not be as extreme as at the higher percentiles, the clinical utility will still be reduced by incorrect classification. The continual growth of large GWAS studies means researchers are more likely to encounter the issue of inter-cohort sample overlap between derivation and target data sets that may artificially increase the concordance of PRS 25 . This was indeed our experience when selecting appropriate PRS. We were able to confirm that most of our selected PRS do not have the sample overlap issue, but not PRS-A for SBP (detail in "Selection of PRS" section). The reason is that the UKB represents by far the largest single study contributing to PRS for blood pressure, and is unlikely to be excluded from any attempt to derive a clinically-applicable PRS for blood pressure. We anticipate that fellow researchers will encounter the same obstacle, and would like to highlight this challenge to the research community.
The sample overlap between the base GWAS (i.e. derivation cohort for deriving PRS) and UKB (i.e. target cohort for calculating PRS) in our PRS-A for SBP is a limitation of our study. As a result, our results on PRS for SBP should be interpreted with the awareness of potential bias introduced by the sample overlap in PRS-A for SBP.
Current methodological development in the sample overlap issue includes the EraSOR (Erase Sample Overlap and Relatedness) method 25 , which is a potential direction for further research to clarify the roles of sample overlap and different statistical methods in the genesis of PRS discordance.
The portability of PRS depends on the characteristics (such as socio-economic status, age or sex) of the individuals in the base GWAS studies, as well as on the GWAS design, even within a single ancestry group 26 . A difference in study characteristics between the PRS derivation and a target cohort could contribute to the disagreement among different PRS that we have observed in this study. The choice of GWAS sample makes implicit assumptions on sample characteristics that may not hold for the prediction set (i.e. target cohort) 26 . An obvious example is different ethnic backgrounds, but we minimized this by restricting to the White British subset of the UKB. Our study further supports the importance of providing study characteristics along with the PRS.
We note that we have not established the reasons for the extent of misclassification between different PRS. It does not appear to be attributable solely to the number of SNPs included in the PRS. We show this for PRS comprised of over 100 thousand versus several hundred SNPs (breast cancer), for PRS composed of hundreds of SNPs (SBP), and for PRS composed of approximately 50 SNPs (dementia). The surprisingly small number of SNPs held in common by different PRS for the same condition published only a year apart indicates that different analytical methods used to derive the PRS accounts for some of the discrepancies in classification; understanding this phenomenon is clearly important for PRS selection in broad clinical practice. The correlations we observed between PRS for the same condition are in the same range as is seen for variation in risk predictors measured several years apart such as blood pressure and serum cholesterol, far from the "fixed" or "one-time" value at birth that is often assumed for PRS. This phenomenon has been previously described. For instance, Läll 27 compared the performance of four PRS in breast cancer prediction, noting that some of the correlations were as low as r = 0.3, and observed that a "metaGRS" of the PRS performed better than any of the individual PRS 28 . After completing our analysis and submitting this paper, we found a preprint 29 with similar findings for two different phenotypes (cardiovascular disease, and educational attainment) in the UKB, providing further generalizability to our results. However, this issue does not seem to be widely appreciated, and most publications comparing a new PRS with previous versions assert the superiority of the new PRS and do not address the issue of misclassification of risk between PRS.
Our observation shows two PRS that only minimally differ in predictive performance on a population level may substantially differ in terms of individual risk classification, even among individuals with the same continental ancestry. This issue requires careful consideration before utilising PRS in real-world settings, because such an arbitrary element in health care is obviously undesirable. An individual's genetic profile is generally considered fixed at birth, leading to the widely held conviction that genetic susceptibility is an immutable value; Table 3. Cross-classification of predicted risk of breast cancer among the whole study population, according to the percentiles of each PRS. Number of participants are shown as n (col%, row%, cell%). Higher percentiles of PRS indicate increased risk of breast cancer; "≥ 99%" percentile corresponds to the top 1% risk.

Percentiles of PRS-A (%)
Percentiles of PRS-B www.nature.com/scientificreports/ however, our findings show that methodologic decisions in the construction of the PRS can lead to meaningful differences in PRS derived from the genome. Although it is reasonable to expect incremental improvements in any risk prediction algorithm over time, these results suggest there is still considerable uncertainty associated with estimates of risk derived from different PRS for the same disease. It will be important to develop guidelines on best practice in constructing PRS to minimize the extent to which people could be given inaccurate or contradictory information over short periods of time.

Data availability
Further summary data can be found in the Supplementary Materials; the authors are happy to provide further information upon the request of individual members of the public. Please note that the UK Biobank does not permit researchers to provide the raw data reported in this paper. However, interested readers are able to request the raw data via application directly to the UK Biobank (https:// www. ukbio bank. ac. uk). The analytical script in R is available in at https:// github. com/ 2cjenn/ AgrPRS.